[Feature #21943] Add `StringScanner#integer_at` by jinroq · Pull Request #193 · ruby/strscan

jinroq · 2026-03-06T08:28:39Z

The specification for MatchData#integer_at has been defined here. StringScanner#integer_at follows this specification.

see: https://bugs.ruby-lang.org/issues/21932#note-6, https://bugs.ruby-lang.org/issues/21932#note-7
see: https://bugs.ruby-lang.org/issues/21943

jinroq · 2026-03-06T08:29:47Z

ext/strscan/strscan.c

    return new_ary;
 }

+#ifdef HAVE_RB_INT_PARSE_CSTR


see: ruby/ruby#16322

kou · 2026-03-06T08:59:25Z

.github/workflows/ci.yml

 on:
 - push
 - pull_request
+- workflow_dispatch


Suggested change

- workflow_dispatch

9e0d504 fixed it.

kou · 2026-03-06T09:03:39Z

ext/strscan/extconf.rb

  have_func("onig_region_memsize(NULL)")
  have_func("rb_reg_onig_match", "ruby/re.h")
  have_func("rb_deprecate_constant")
+  have_func("rb_int_parse_cstr")


strscan requires Ruby 2.4 or later.
What is the minimum Ruby version to use rb_int_parse_cstr()?

rb_int_parse_cstr has been available since Ruby 2.5.0. In Ruby 2.4, it is detected using have_func, and if it is not available, it falls back to rb_str_to_inum.

OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?

kou · 2026-03-06T09:04:22Z

ext/strscan/strscan.c

+#ifdef HAVE_RB_INT_PARSE_CSTR
+VALUE rb_int_parse_cstr(const char *str, ssize_t len, char **endp,
+                        size_t *ndigits, int base, int flags);
+#define RB_INT_PARSE_SIGN 0x01


If ruby/ruby#16322 is merged, this will report a duplicated definition warning.

1630df8 fixed it.

Can we omit rb_int_parse_cstr() prototype and RB_INT_PARSE_SIGN definition entirely when Ruby provides them?

kou · 2026-03-06T09:05:24Z

ext/strscan/strscan.c

    rb_define_method(StringScanner, "size",        strscan_size,        0);
    rb_define_method(StringScanner, "captures",    strscan_captures,    0);
    rb_define_method(StringScanner, "values_at",   strscan_values_at,  -1);
+    rb_define_method(StringScanner, "integer_at",     strscan_integer_at,     1);


Suggested change

rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);

rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);

1630df8 fixed it.

eregon · 2026-03-06T18:14:07Z

If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it.

@kou Do you know why StringScanner has an interface very similar to MatchData but yet doesn't expose the MatchData object?
In fact on TruffleRuby StringScanner uses MatchData objects internally.

I think it would be better to expose the MatchData object than keep defining methods similar to MatchData but with slightly different names. I think it makes it harder to learn the StringScanner API (i.e., it would be smaller and easier to approach if it didn't duplicate many MatchData methods).

Among all StringScanner instance methods:

  <<, [], beginning_of_line?, captures, charpos, check, check_until,
  concat, eos?, exist?, fixed_anchor?, get_byte, getch, initialize_copy,
  inspect, match?, matched, matched?, matched_size, named_captures,
  peek, peek_byte, pointer, pointer=, pos, pos=, post_match, pre_match,
  reset, rest, rest_size, scan, scan_byte, scan_integer, scan_until,
  size, skip, skip_until, string, string=, terminate, unscan, values_at

These are just doing the same on the MatchData:

  [], captures
  matched, matched?,
  matched_size (same as `byteend(0) - bytebegin(0)`, named_captures
  post_match, pre_match
  size, string, values_at

And these are MatchData methods which StringScanner doesn't have:

  begin, bytebegin, byteend, byteoffset, deconstruct,
  deconstruct_keys, end, length, match,
  match_length, names, offset,
  regexp, to_a

eregon · 2026-03-06T18:54:23Z

If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it.

Mmh, but that likely wouldn't achieve as good a speedup as the current approach in the context of https://bugs.ruby-lang.org/issues/21943 as it would mean an extra MatchData allocation.
The strscan extension seems to save the matched captures but not a MatchData object:

strscan/ext/strscan/strscan.c

Lines 57 to 58 in 3592c39

    
           /* the regexp register; legal only when MATCHED_P(s) */ 
        
           struct re_registers regs;

BTW the presense of StringScanner.must_C_version makes me wonder, was StringScanner once written in Ruby?

kou · 2026-03-07T05:14:49Z

test/strscan/test_stringscanner.rb

+  def test_integer_at_large_number
+    huge = '9' * 100
+    s = create_string_scanner(huge)
+    s.scan(/(#{huge})/)


Suggested change

s.scan(/(#{huge})/)

s.scan(/(\d+)/)

kou · 2026-03-07T05:17:22Z

test/strscan/test_stringscanner.rb

+  end
+
+  def test_integer_at_leading_zeros
+    s = create_string_scanner("007")


007 is not a good data for this because 007 is valid both for base=10 and base=8. Do we need this test?

kou · 2026-03-07T05:18:34Z

test/strscan/test_stringscanner.rb

+    # "09" would be invalid in octal, but integer_at always uses base 10
+    s = create_string_scanner("09")
+    s.scan(/(\d+)/)
+    assert_equal(9, s.integer_at(1))
+
+    # "010" is 8 in octal (Integer("010")), but 10 in base 10
+    s = create_string_scanner("010")
+    s.scan(/(\d+)/)
+    assert_equal(10, s.integer_at(1))


Do we need both of them? Can they to catch any different problem?

kou · 2026-03-07T05:25:18Z

ext/strscan/extconf.rb

  have_func("onig_region_memsize(NULL)")
  have_func("rb_reg_onig_match", "ruby/re.h")
  have_func("rb_deprecate_constant")
+  have_func("rb_int_parse_cstr")


OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?

kou · 2026-03-07T05:26:36Z

ext/strscan/strscan.c

+        long j = 0;
+        if (ptr[0] == '-' || ptr[0] == '+') j = 1;
+        if (j >= len) {
+            rb_raise(rb_eArgError,
+                     "non-digit character in capture: %.*s",
+                     (int)len, ptr);
+        }
+        for (; j < len; j++) {
+            if (ptr[j] < '0' || ptr[j] > '9') {
+                rb_raise(rb_eArgError,
+                         "non-digit character in capture: %.*s",
+                         (int)len, ptr);
+            }
+        }
+        return rb_str_to_inum(rb_str_new(ptr, len), 10, 0);


Does this accept 1_234?
See also: #192 (comment)

kou · 2026-03-07T05:28:14Z

ext/strscan/strscan.c

+    GET_SCANNER(self, p);
+    if (! MATCHED_P(p))        return Qnil;
+
+    switch (TYPE(idx)) {
+        case T_SYMBOL:
+            idx = rb_sym2str(idx);
+            /* fall through */
+        case T_STRING:
+            RSTRING_GETMEM(idx, name, i);
+            i = name_to_backref_number(&(p->regs), p->regex, name, name + i, rb_enc_get(idx));
+            break;
+        default:
+            i = NUM2LONG(idx);
+    }
+
+    if (i < 0)
+        i += p->regs.num_regs;
+    if (i < 0)                 return Qnil;
+    if (i >= p->regs.num_regs) return Qnil;
+    if (p->regs.beg[i] == -1)  return Qnil;


You copied this from strscan_aref(), right? Can we share common code with strscan_aref() and strsacn_integer_at()?

kou · 2026-03-07T05:29:29Z

ext/strscan/strscan.c

+    end = adjust_register_position(p, p->regs.end[i]);
+    len = end - beg;
+
+    if (len <= 0) {


Can we use == 0 here?
len may be negative?

kou · 2026-03-07T05:36:29Z

ext/strscan/strscan.c

+    len = end - beg;
+
+    if (len <= 0) {
+        rb_raise(rb_eArgError, "empty capture for integer conversion");


Suggested change

rb_raise(rb_eArgError, "empty capture for integer conversion");

rb_raise(rb_eArgError, "specified capture is empty: %"PRIsVALUE, idx);

kou · 2026-03-07T05:37:07Z

ext/strscan/strscan.c

+
+        if (endp != ptr + len) {
+            rb_raise(rb_eArgError,
+                     "non-digit character in capture: %.*s",


Is there any other reason on failure?

kou · 2026-03-07T05:38:15Z

ext/strscan/strscan.c

+
+        if (endp != ptr + len) {
+            rb_raise(rb_eArgError,
+                     "non-digit character in capture: %.*s",


If the target string has a trailing space, it's difficult to find a problem. How about surround the target string something like the following?

Suggested change

"non-digit character in capture: %.*s",

"non-digit character in capture: <%.*s>",

kou · 2026-03-07T05:44:52Z

Do you know why StringScanner has an interface very similar to MatchData but yet doesn't expose the MatchData object?

No. But if we create a MatchData, it causes performance overhead, right? (StringScanner doesn't use MatchData internally.) It'll reduce a merit of this optimization.

BTW the presense of StringScanner.must_C_version makes me wonder, was StringScanner once written in Ruby?

Yes. But it's before StringScanner was imported to Ruby itself.

FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese)

eregon · 2026-03-07T09:49:18Z

No. But if we create a MatchData, it causes performance overhead, right?

Yeah, and I guess that's the main reason StringScanner directly exposes MatchData-like methods.
StringScanner could still have a new method to return a MatchData, so MatchData methods which are not mirrored in StringScanner could be used.

FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese)

Interesting, thank you for the link.

kou · 2026-03-07T23:00:06Z

StringScanner could still have a new method to return a MatchData, so MatchData methods which are not mirrored in StringScanner could be used.

Yes. But it's out-of-scope of this.

Add a method that returns a captured substring as an Integer, following String#to_i(base) semantics. Accepts an optional base argument (default 10), Symbol/String for named capture groups, and returns 0 for non-numeric or empty captures. Extract resolve_capture_index helper to share index resolution logic between StringScanner#[] and StringScanner#integer_at.

When base is 10 and the capture contains only digits (with optional sign) that fit in long, parse directly and return via LONG2NUM. This covers the Date._strptime use case without temporary String creation. All other cases fall through to rb_str_to_inum.

Provide a pure Ruby implementation using self[index].to_i(base) for JRuby and other non-CRuby platforms. The C extension version takes precedence when available.

jinroq · 2026-03-22T15:42:44Z

@kou
This comment finalizes the specification for MatchData#integer_at(n). StringScanner#integer_at(n) has also been updated to comply with this specification. Please review it.

kou · 2026-03-22T21:43:44Z

ext/strscan/strscan.c

+     * This covers the Date._strptime use case. */
+    if (base == 10) {
+        long j = 0;
+        int negative = 0;


Could you use bool instead of int for boolean?

485f27c fixed it.

kou · 2026-03-22T21:49:48Z

lib/strscan/strscan.rb

+  unless method_defined?(:integer_at)
+    # Fallback implementation for platforms without C extension (e.g. JRuby).
+    # Equivalent to self[index].to_i(base).
+    def integer_at(index, base = 10)
+      str = self[index]
+      return nil if str.nil?
+      str.to_i(base)
+    end
+  end
+


Please don't split #scan_integer documentation and implementation.

ef6281d fixed it.

kou · 2026-03-22T21:52:59Z

ext/strscan/strscan.c

+                }
+            }
+            if (all_digits) {
+                if (digit_count <= (sizeof(long) >= 8 ? 18 : 9)) {


It seems that 9223372036854775807 (max int64_t value) isn't optimized. Is it intentional?

It seems that 00000000000000000001 isn't optimized. Is it intentional?

That was not my intention. 6052143 fixed it.

kou · 2026-03-22T21:54:18Z

test/strscan/test_stringscanner.rb

+
+  def test_integer_at_index_zero
+    s = create_string_scanner("42 abc")
+    s.scan(/(\d+)/)


We don't need (...) here, right?

Suggested change

s.scan(/(\d+)/)

s.scan(/\d+/)

f64fdd8 fixed it.

kou · 2026-03-22T21:54:47Z

test/strscan/test_stringscanner.rb

    assert_equal({"number" => "1"}, scan.named_captures)
  end

+  def test_integer_at


Could you use test_integer_at_XXX like other methods?

6a55781 fixed it.

kou · 2026-03-23T00:37:02Z

test/strscan/test_stringscanner.rb

+  def test_integer_at_named_capture_undefined
+    s = create_string_scanner("2024-06-15")
+    s.scan(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/)
+    assert_raise(IndexError) { s.integer_at(:unknown) }
+    assert_raise(IndexError) { s.integer_at("unknown") }


Can we use unknown for both of test name and test value?

4d3583f fixed it.

kou · 2026-03-23T00:38:47Z

test/strscan/test_stringscanner.rb

+  def test_integer_at_underscore
+    # follows String#to_i: underscores are accepted
+    s = create_string_scanner("1_0_0")
+    s.scan(/(\d+(?:_\d+)*)/)
+    assert_equal(100, s.integer_at(1))
+  end


Can we optimize this case too?

e686711 fixed it.

kou · 2026-03-23T00:41:32Z

test/strscan/test_stringscanner.rb

+    assert_equal(999999999999999999, s.integer_at(1))
+
+    # 19 digits: exceeds long on 64-bit, becomes bignum
+    s = create_string_scanner("9999999999999999999")


In general, we should use border values for testing. If "9" * 18 is the largest optimized value, we should use "9" * 18" and "1" * 19" (the next value of "9" * 18").

0f2ad2a fixed it.

kou · 2026-03-23T00:45:28Z

ext/strscan/strscan.c


+/*
+ * call-seq:
+ *   integer_at(index, base = 10) -> integer or nil


Could you use specifier not index like we did for []?

strscan/ext/strscan/strscan.c

Lines 1625 to 1695 in 4243751

/*

*

* :markup: markdown

* :include: strscan/link_refs.txt

*

* call-seq:

* [](specifier) -> substring or nil

*

* Returns a captured substring or `nil`;

* see [Captured Match Values][13].

*

* When there are captures:

*

* ```rb

* scanner = StringScanner.new('Fri Dec 12 1975 14:39')

* scanner.scan(/(?<wday>\w+) (?<month>\w+) (?<day>\d+) /)

* ```

*

* - `specifier` zero: returns the entire matched substring:

*

* ```rb

* scanner[0] # => "Fri Dec 12 "

* scanner.pre_match # => ""

* scanner.post_match # => "1975 14:39"

* ```

*

* - `specifier` positive integer. returns the `n`th capture, or `nil` if out of range:

*

* ```rb

* scanner[1] # => "Fri"

* scanner[2] # => "Dec"

* scanner[3] # => "12"

* scanner[4] # => nil

* ```

*

* - `specifier` negative integer. counts backward from the last subgroup:

*

* ```rb

* scanner[-1] # => "12"

* scanner[-4] # => "Fri Dec 12 "

* scanner[-5] # => nil

* ```

*

* - `specifier` symbol or string. returns the named subgroup, or `nil` if no such:

*

* ```rb

* scanner[:wday] # => "Fri"

* scanner['wday'] # => "Fri"

* scanner[:month] # => "Dec"

* scanner[:day] # => "12"

* scanner[:nope] # => nil

* ```

*

* When there are no captures, only `[0]` returns non-`nil`:

*

* ```rb

* scanner = StringScanner.new('foobarbaz')

* scanner.exist?(/bar/)

* scanner[0] # => "bar"

* scanner[1] # => nil

* ```

*

* For a failed match, even `[0]` returns `nil`:

*

* ```rb

* scanner.scan(/nope/) # => nil

* scanner[0] # => nil

* scanner[1] # => nil

* ```

*

*/

433fd87 fixed it.

kou · 2026-03-23T00:46:04Z

ext/strscan/strscan.c

+    VALUE idx, vbase;
+    int base = 10;
+
+    rb_scan_args(argc, argv, "11", &idx, &vbase);


Could you use specifier not idx?

433fd87 fixed it.

Skip leading zeros to compute effective digit count, allowing values like "00000000000000000001" to use the fast path. Add overflow-checked parsing for 19-digit values so LONG_MAX fits in the fast path while LONG_MAX+1 correctly falls through to rb_str_to_inum.

Remove nested capture group and check group 3 directly for nil.

…non_digits

Non-digit behavior is already covered by test_integer_at_non_digit and index 0 is covered by test_integer_at_index_zero.

…stently

Extend base-10 fast path to parse underscore-separated digits(e.g. "1_000_000") without temporary String allocation, following String#to_i underscore rules.

…paring

…path

…fixnum_bignum_boundary

… test_integer_at_fixnum_bignum_boundary

…g boundary tests

jinroq · 2026-03-31T14:07:03Z

@kou

Could you also add a test when the current position isn't 0?

6b44dbd fixed it.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T05:17:18Z

ext/strscan/strscan.c

+            bool overflow = false;
+            for (; j < len; j++) {
+                if (ptr[j] != '_') {
+                    unsigned long d = ptr[j] - '0';
+                    result = result * 10 + d;
+                    if (result > limit) {
+                        overflow = true;
+                        break;
+                    }


In parse_decimal_fast, the overflow-check path uses unsigned long result and performs result = result * 10 + d before comparing to limit. On 32-bit platforms, unsigned long can overflow/wrap during the multiply/add for inputs > ULONG_MAX, which can make result > limit unreliable and return an incorrect (wrapped) value instead of falling back to rb_str_to_inum (e.g., parsing a 10-digit value like "9999999999"). Consider using a wider intermediate type (e.g., unsigned long long) or doing a pre-multiply overflow check (result > limit/10 etc.) before updating result.

Suggested change

bool overflow = false;

for (; j < len; j++) {

if (ptr[j] != '_') {

unsigned long d = ptr[j] - '0';

result = result * 10 + d;

if (result > limit) {

overflow = true;

break;

}

unsigned long limit_div10 = limit / 10;

unsigned long limit_mod10 = limit % 10;

bool overflow = false;

for (; j < len; j++) {

if (ptr[j] != '_') {

unsigned long d = (unsigned long)(ptr[j] - '0');

if (result > limit_div10 ||

(result == limit_div10 && d > limit_mod10)) {

overflow = true;

break;

}

result = result * 10 + d;

e6e5c27 fixed it

Copilot · 2026-04-01T05:17:18Z

ext/strscan/strscan.c

+    int base = 10;
+
+    rb_scan_args(argc, argv, "11", &specifier, &vbase);
+    if (!NIL_P(vbase)) base = NUM2INT(vbase);


integer_at currently treats an explicit nil second argument the same as omitting the argument (base remains 10 because of if (!NIL_P(vbase)) ...). This diverges from the documented equivalence to self[specifier].to_i(base), where passing nil for base raises a TypeError (and it also diverges from the pure-Ruby fallback, which would call to_i(nil) and raise). Consider changing the base handling to distinguish “argument not provided” vs “provided as nil” (e.g., check argc instead of NIL_P).

Suggested change

if (!NIL_P(vbase)) base = NUM2INT(vbase);

if (argc > 1) base = NUM2INT(vbase);

ba15508 fixed it.

Copilot · 2026-04-01T05:17:18Z

test/strscan/test_stringscanner.rb

+    assert_equal(0, s.integer_at(1))       # base 10: "0xF".to_i => 0
+    assert_equal(15, s.integer_at(1, 0))   # base 0: "0xF".to_i(0) => 15
+  end
+


integer_at accepts an optional base argument, but the new tests don’t currently cover error cases for that parameter (notably base = nil, which should behave the same as String#to_i(nil) / raise TypeError if the implementation follows the stated self[specifier].to_i(base) equivalence). Adding a regression test here would help ensure the C and Ruby fallback implementations stay aligned.

Suggested change

def test_integer_at_base_nil

s = create_string_scanner("2024")

s.scan(/(\d+)/)

assert_raise(TypeError) { s.integer_at(1, nil) }

end

#193 (comment) fixed it.

kou · 2026-04-01T05:14:17Z

ext/strscan/strscan.c

+/*
+ * call-seq:
+ *   integer_at(specifier, base = 10) -> integer or nil
+ *
+ * Returns the captured substring at the given +specifier+ as an Integer,
+ * following the behavior of <tt>String#to_i(base)</tt>.
+ *
+ * +specifier+ can be an Integer (positive, negative, or zero), a Symbol,
+ * or a String for named capture groups.
+ *
+ * Returns +nil+ if:
+ * - No match has been performed or the last match failed
+ * - The +specifier+ is an Integer and is out of range
+ * - The group at +specifier+ did not participate in the match
+ *
+ * Raises IndexError if +specifier+ is a Symbol or String that does not
+ * correspond to a named capture group, consistent with
+ * <tt>StringScanner#[]</tt>.
+ *
+ * This is semantically equivalent to <tt>self[specifier].to_i(base)</tt>
+ * but avoids the allocation of a temporary String when possible.
+ *
+ *   scanner = StringScanner.new("2024-06-15")
+ *   scanner.scan(/(\d{4})-(\d{2})-(\d{2})/)
+ *   scanner.integer_at(1)       # => 2024
+ *   scanner.integer_at(1, 16)   # => 8228
+ *
+ */


Could you move this just before static VALUE strscan_integer_at()?

9efb4d2 fixed it.

kou · 2026-04-01T05:20:26Z

ext/strscan/strscan.c

+
+        if (effective_digits <= (sizeof(long) >= 8 ? INT64_DECIMAL_SAFE_DIGITS : INT32_DECIMAL_SAFE_DIGITS)) {
+            long result = 0;
+            for (; j < len; j++) {


Can we use for (k = first_nonzero; k < len; k++) here?

f51e1cf fixed it.

kou · 2026-04-01T05:20:39Z

ext/strscan/strscan.c

+                ? (unsigned long)LONG_MAX + 1
+                : (unsigned long)LONG_MAX;
+            bool overflow = false;
+            for (; j < len; j++) {


f51e1cf fixed it.

kou · 2026-04-01T05:23:13Z

ext/strscan/strscan.c

+static inline VALUE
+parse_decimal_fast(const char *ptr, long len)
+{
+    long j = 0;


Can we use meaningful name instead of j? I think that this is not a loop variable.

764a4a1 fixed it.

kou · 2026-04-01T05:23:39Z

ext/strscan/strscan.c

+
+    /* Validate: only digits and underscores (not leading/trailing/consecutive) */
+    {
+        long k;


Can we use i not k?

0f67657 fixed it.

kou · 2026-04-01T05:25:25Z

ext/strscan/strscan.c

+            }
+            if (!overflow) {
+                if (negative) {
+                    if (result == (unsigned long)LONG_MAX + 1)


Can we use limit here?

7f19293 fixed it.

kou · 2026-04-01T05:29:28Z

lib/strscan/strscan.rb

+      str = self[specifier]
+      return nil if str.nil?
+      str.to_i(base)


Can we use self[specifier]&.to_i(base)?

e4c6a1e fixed it.

…fore comparing" This reverts commit db383e9.

…wraparound

…arse_decimal_fast

tompng · 2026-04-03T18:59:29Z

ext/strscan/strscan.c

+            if (negative) result = -result;
+            return LONG2NUM(result);
+        }
+        /* One more digit than safe: may still fit in long with overflow check */


What do you think of combining validation/counting/safe_digits_branch/one_more_digit_branch to a single for-loop like this?

for (k = j; k < len; k++) { //single moderate-fast path if (invalid) stop_parsing_or_fallback_to_slow_path; if (result > (limit - d) / 10) fallback_to_slow_path; // always check overflow result = result * 10 + d; }

It always checks overflow, which is an overhead. It always performs multiplication even for > 19 digits case.
But code is simple, less branch, boundary test will be simple too.

Compared to this, the current code is really optimized to remove all the overhead that the simple for-loop has.

validation: to eliminate multiplication for invalid case

counting: to eliminate useless multiplication for large number of digits case

safe_digits_branch: fast path to eliminate overflow check

I just want to know if the performance improvement of these optimizations are really worth adding complexity to the code.
My guess is, a simple and moderate fast path is fast enough for the use case.

If the overflow check overhead needs to be reduced, something like result > (LONG_MAX-9)/10 && result > (limit-d)/10 might be performant.

…teger_at

jinroq · 2026-04-05T08:03:56Z

@kou

I’d like to hear your thoughts on this comment.

Regarding Date._strptime, this comment makes a valid point.
Since the year is 4 digits, the month is 2 digits, and the day is 2 digits - totaling 18 digits or fewer - paths other than the safe path (such as the counting and overflow check paths) are rarely reached in actual use cases.
Compared to the two-pass approach currently used in the code (validation + counting -> accumulation), a simple single loop can be expected to achieve comparable performance in most cases.

However, StringScanner#integer_at is not a feature intended solely for Date.

Since a limit must be set for inputs with more than 19 digits (as in the test_integer_at_large_number case), the limit must be determined before the negative is finalized, which can ultimately result in code that is at least as complex
The current implementation correctly handles all cases involving signs, underscores, and leading zeros; if replaced with a simple loop, these operations would need to be crammed into a single loop.

I don’t think I need to respond.

kou · 2026-04-05T08:57:10Z

Could you add your comment to the thread #193 (comment) for easy to follow?

jinroq commented Mar 6, 2026

View reviewed changes

ext/strscan/strscan.c Outdated

return new_ary;

}

#ifdef HAVE_RB_INT_PARSE_CSTR

Copy link
Copy Markdown

Author

jinroq Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see: ruby/ruby#16322

kou reviewed Mar 6, 2026

View reviewed changes

jinroq requested a review from kou March 6, 2026 16:43

kou reviewed Mar 7, 2026

View reviewed changes

eregon mentioned this pull request Mar 9, 2026

Implement StringScanner for TruffleRuby in pure Ruby #195

Merged

jinroq added 2 commits March 18, 2026 22:04

jinroq force-pushed the add_integer_at branch from 08a7405 to efe8d0f Compare March 18, 2026 13:44

jinroq mentioned this pull request Mar 18, 2026

Expose rb_int_parse_cstr() as public API ruby/ruby#16322

Closed

Add Ruby fallback for integer_at on platforms without C extension

d74afd1

Provide a pure Ruby implementation using self[index].to_i(base) for JRuby and other non-CRuby platforms. The C extension version takes precedence when available.

jinroq requested a review from kou March 22, 2026 15:42

kou reviewed Mar 23, 2026

View reviewed changes

jinroq added 12 commits March 23, 2026 20:48

Use bool instead of int for boolean variables in integer_at fast path

485f27c

Move integer_at fallback after scan_integer to keep docs together

ef6281d

Remove unnecessary capture group in test_integer_at_index_zero

f64fdd8

Rename test_integer_at to test_integer_at_date_parts for consistency

6a55781

Simplify regex in test_integer_at_optional_group_not_matched

67a10ae

Remove nested capture group and check group 3 directly for nil.

Rename variable huge to large in test_integer_at_large_number

9ab1172

Replace magic numbers with INT64/INT32_DECIMAL_SAFE_DIGITS constants

960130b

Remove unnecessary capture groups in test_integer_at_full_match_with_…

0065ecf

…non_digits

Remove redundant test_integer_at_full_match_with_non_digits

d89e54b

Non-digit behavior is already covered by test_integer_at_non_digit and index 0 is covered by test_integer_at_index_zero.

Rename test_integer_at_named_capture_undefined to use "unknown" consi…

4d3583f

…stently

Optimize fast path to handle underscored digit strings

e686711

Extend base-10 fast path to parse underscore-separated digits(e.g. "1_000_000") without temporary String allocation, following String#to_i underscore rules.

jinroq added 11 commits March 25, 2026 22:56

Add test for integer_at when scanner position is not at the beginning

6b44dbd

Rename parameter idx to specifier in resolve_capture_index

b6fe693

Extract base-10 fast path into parse_decimal_fast static inline function

a078c0d

Add else for readability in overflow-checked path of parse_decimal_fast

b1b135a

Simplify overflow check in parse_decimal_fast by computing before com…

db383e9

…paring

Use else if and == for boundary check in parse_decimal_fast overflow …

a421b12

…path

Rename test_integer_at_date_parts to test_integer_at_positive_index

a81aa3e

Fix boundary test to use smallest 19-digit number in test_integer_at_…

d07b086

…fixnum_bignum_boundary

Use "0" * 19 + "1" for readability in leading zeros test

1ae5772

Fix negative boundary test to use smallest 19-digit absolute value in…

04a7f84

… test_integer_at_fixnum_bignum_boundary

Split test_integer_at_fixnum_bignum_boundary into digit_count and lon…

21c6be8

…g boundary tests

jinroq requested a review from kou March 31, 2026 14:07

kou requested a review from Copilot April 1, 2026 05:11

Copilot started reviewing on behalf of kou April 1, 2026 05:12 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

kou reviewed Apr 1, 2026

View reviewed changes

jinroq added 6 commits April 1, 2026 21:13

Revert "Simplify overflow check in parse_decimal_fast by computing be…

e6e5c27

…fore comparing" This reverts commit db383e9.

Add comment explaining pre-check guards against 32-bit unsigned long …

d5b3651

…wraparound

Raise TypeError for explicit nil base argument in integer_at

ba15508

Move RDoc comment to immediately before strscan_integer_at

9efb4d2

Start accumulation loop from first_nonzero to skip leading zeros in p…

f51e1cf

…arse_decimal_fast

Rename j to digits_start in parse_decimal_fast for clarity

764a4a1

tompng reviewed Apr 3, 2026

View reviewed changes

jinroq added 4 commits April 5, 2026 13:11

Rename loop variable k to i in parse_decimal_fast

0f67657

Use limit variable instead of repeating LONG_MAX + 1 in LONG_MIN check

7f19293

Simplify integer_at fallback using safe navigation operator

e4c6a1e

Add braces to if/else and comment to empty capture early return in in…

9285091

…teger_at

	rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);
	rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);

	rb_raise(rb_eArgError, "empty capture for integer conversion");
	rb_raise(rb_eArgError, "specified capture is empty: %"PRIsVALUE, idx);

	"non-digit character in capture: %.*s",
	"non-digit character in capture: <%.*s>",

	/*
	*
	* :markup: markdown
	* :include: strscan/link_refs.txt
	*
	* call-seq:
	* [](specifier) -> substring or nil
	*
	* Returns a captured substring or `nil`;
	* see [Captured Match Values][13].
	*
	* When there are captures:
	*
	* ```rb
	* scanner = StringScanner.new('Fri Dec 12 1975 14:39')
	* scanner.scan(/(?<wday>\w+) (?<month>\w+) (?<day>\d+) /)
	* ```
	*
	* - `specifier` zero: returns the entire matched substring:
	*
	* ```rb
	* scanner[0] # => "Fri Dec 12 "
	* scanner.pre_match # => ""
	* scanner.post_match # => "1975 14:39"
	* ```
	*
	* - `specifier` positive integer. returns the `n`th capture, or `nil` if out of range:
	*
	* ```rb
	* scanner[1] # => "Fri"
	* scanner[2] # => "Dec"
	* scanner[3] # => "12"
	* scanner[4] # => nil
	* ```
	*
	* - `specifier` negative integer. counts backward from the last subgroup:
	*
	* ```rb
	* scanner[-1] # => "12"
	* scanner[-4] # => "Fri Dec 12 "
	* scanner[-5] # => nil
	* ```
	*
	* - `specifier` symbol or string. returns the named subgroup, or `nil` if no such:
	*
	* ```rb
	* scanner[:wday] # => "Fri"
	* scanner['wday'] # => "Fri"
	* scanner[:month] # => "Dec"
	* scanner[:day] # => "12"
	* scanner[:nope] # => nil
	* ```
	*
	* When there are no captures, only `[0]` returns non-`nil`:
	*
	* ```rb
	* scanner = StringScanner.new('foobarbaz')
	* scanner.exist?(/bar/)
	* scanner[0] # => "bar"
	* scanner[1] # => nil
	* ```
	*
	* For a failed match, even `[0]` returns `nil`:
	*
	* ```rb
	* scanner.scan(/nope/) # => nil
	* scanner[0] # => nil
	* scanner[1] # => nil
	* ```
	*
	*/

	if (!NIL_P(vbase)) base = NUM2INT(vbase);
	if (argc > 1) base = NUM2INT(vbase);

+  def test_integer_at_base_nil
+    s = create_string_scanner("2024")
+    s.scan(/(\d+)/)
+    assert_raise(TypeError) { s.integer_at(1, nil) }
+  end

Conversation

jinroq commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eregon commented Mar 6, 2026

Uh oh!

eregon commented Mar 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kou commented Mar 7, 2026

Uh oh!

eregon commented Mar 7, 2026

Uh oh!

kou commented Mar 7, 2026

Uh oh!

jinroq commented Mar 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinroq commented Mar 6, 2026 •

edited

Loading